Less is more? How demographic sample weights can improve public opinion estimates based on Twitter data

نویسنده

Pablo Barberá

چکیده

An important limitation in previous studies of political behavior using Twitter data is the lack of information about the sociodemographic characteristics of individual users. This paper addresses this challenge by developing new machine learning methods that will allow researchers to estimate the age, gender, race, party affiliation, propensity to vote, and income of any Twitter user in the U.S. with high accuracy. The training dataset for these classifiers was obtained by matching a massive dataset of 1 billion geolocated Twitter messages with voting registration records and estimates of home values across 15 different states, resulting in a sample of nearly 250,000 Twitter users whose sociodemographic traits are known. I illustrate the value of these new methods with two applications. First, I explore how attention to different candidates in the 2016 presidential primary election varies across demographic groups within a panel of randomly selected Twitter users. I argue that these covariates can be used to adjust estimates of sentiment towards political actors based on Twitter data, and provide a proof of concept using presidential approval. Second, I examine whether social media can reduce inequalities in potential exposure to political messages. In particular, I show that retweets (a proxy for inadvertent exposure) have a large equalizing effect in access to information. Twitter data is widely acknowledged to hold great promise for the study of social and political behavior (Mejova, Weber, and Macy 2015; Jungherr 2015). In a context of plummeting survey response rates, tweets represent unfiltered expressions of political opinions, which have been found to be correlated with offline opinions and behavior (O’Connor et al. 2010; DiGrazia et al. 2013; Vaccari et al. 2013). More generally, Twitter data also allows researchers to easily and unobtrusively observe social interactions in real-time, and to measure consumption of political information with a level of granularity that could only be achieved in the past at great cost. Despite the great promise of this source of data, an important challenge that remains to be overcome is the lack of sociodemographic information about Twitter users. Unlike Working paper. This version: February 24, 2015. The author gratefully acknowledges financial support from the Gordon and Betty Moore Foundation and the Alfred P. Sloan Foundation. other social media platforms, Twitter does not require its users to provide basic information about themselves, such as gender or age. As a result, researchers interested in working with Twitter data cannot construct survey weights to recover the representativeness of their samples in the same way that survey researchers combine probability sampling with post-stratification weights to reduce sampling selection bias (Schober et al. 2016). Beyond this methodological concern, the availability of individual-level covariates would expand the range of questions that can be studied with Twitter data. For example, if we were interested in measuring support for political candidates in a primary election, it would allow us to subset only those that are affiliated with that party. Being able to identify income, gender, and race would enable studies of social segregation in online settings. We could also study social inequalities in political behavior at a much more granular level if we were able to observe the individual traits of Twitter users. The contribution of this paper is to develop new methods to estimate the age, gender, race, party affiliation, propensity to vote, and income of any Twitter user in the U.S. This work improves upon previous studies on latent attribute inference based on Twitter data (Al Zamal, Liu, and Ruths 2012; Chen et al. 2015; Mislove et al. 2011; Pennacchiotti and Popescu 2011; Rao et al. 2010) in two different ways. First, by relying on a ground truth dataset at least two orders of magnitude larger than those used in previous studies, this method achieves significantly better performance in this task. Second, and most importantly, the features used to predict Twitter users’ latent traits can be measured using no more than 5 API calls per user, which makes it easy to scale to large datasets. This paper also provides two applications of these methods to questions of substantive interest. First, I examine how attention to different candidates in the 2016 presidential primary election varies across sociodemographic groups using a panel of 200,000 users, randomly selected. This panel design overcomes some of the difficulties inherent to selfselection bias and, in combination with the sampling weights that can now be computed using the latent traits estimated with the method introduced here, could potentially allow researchers to recover the representativeness of estimates based on Twitter data. Second, I examine how exposure to political information on Twitter varies across sociodemographic groups. Merging data about who retweeted particular political messages with the lists of who each individual in this panel of 200,000 users follows, I am able to quantify direct (via following) and indirect exposure (via retweets). This analysis shows that even if direct exposure is highly unequal across social groups, the differences are significantly reduced once inadvertent exposure is considered. This result highlights the potential of social media to reduce inequalities in access to political information. Background and Related Work Previous studies have approached the problem of estimating the sociodemographic characteristics of Twitter users using one of two approaches. One option is to apply supervised machine learning methods to a training dataset of users whose traits are known, usually by human coding. For example, Cheng (2015) used Amazon Mechanical Turk to label the ethnicity, gender, and age of 2,000 users, and then ran different classifiers using features from users’ tweets, their neighbors, and their profile pictures. Pennacchiotti (2011) employed a similar method with a sample of 6,000 users who stated their ethnicity in their descriptions, and 10,000 users who added themselves to a public directory of Democrats and Republicans on Twitter. Al Zamal (2012) used the same source for political orientation, and 400 tweets from users announcing their own birthday to identify age. They considered a similar set of features – both information about users’ tweets and about their friends and followers. A second approach is to rely on indirect methods, such as extracting Twitter users’ names, and comparing those with existing datasets with distributions of gender by first name and of ethnicity by last name to compute a probability of being male or female, and Caucasian, AfricanAmerican, etc. (Mislove et al. 2011). A different type of indirect approach was used by Culotta et al (2015) – using website audience data, they show that followers of the Twitter accounts of these websites have a similar demographic composition. Within this category we would also find unsupervised methods that detect latent communities based on interactions on Twitter, building upon the assumption that behavior is homophilic (Conover et al. 2012; Barberá 2015a). Both approaches have limitations that make it difficult to scale these methods to large samples of users. Indirect methods do not perform well when sociodemographic traits that are not heavily correlated with behavior, and name-based methods cannot be applied when new names are not included in the lists of names tagged by gender, which limits their applicability. For example, nearly 110,00 of 250,000 (44%) randomly selected U.S. Twitter users (see Applications section) did not report a first name that appears in the Social Security Administration baby names dataset (Blevins and Mullen 2015). Supervised methods do not suffer from this problem but, because of their use of small, self-selected samples, they require collecting “costly” features in order to achieve high accuracy. Measuring features such as the text of a user’s neighbors (her followers and those that she follows) is very time-consuming because it requires hundreds of API calls, making this method impractical for any sample larger than a few thousand users. The aim of this paper is to develop a new approach that overcomes these limitations and allows any researcher to (1) estimate the age, gender, race/ethnicity, income, propensity to vote, and party affiliation of (2) any Twitter user, and (3) with fewer than 5 API calls per user. Method Even if the sociodemographic characteristics of Twitter users cannot be directly observed, there are at least two different types of information that researchers could use to infer them. Text of users’ tweets. A range of previous studies have shown significant differences in language use between men and women (Newman et al. 2008), liberals and conservatives (Sylwester and Purver 2015), individuals of different age (Schwartz et al. 2013) and race groups (Florini 2013). Language use indicates not only differences in personality or opinions, but also in interests and activities, which may also be correlated with users’ sociodemographic characteristics. Text in microblogging platform such as Twitter often includes emoji characters – ideograms that include facial expressions, objects, flags, among others, and which often can convey more complex ideas than single words. To test whether language use predicts users’ latent traits, I will estimate two models, of increasing complexity: first, a logistic classifier with Elastic Net regularization (Zou and Hastie 2005) using only emoji characters as features (bagof-emoji); second, a logistic classifier with Elastic Net regularization and Stochastic Gradient Descent (SGD) learning using word counts as features (bag-of-words, BOW), and applying a TF-IDF transformation. To reduce the size of the feature matrix, I will only consider emoji and words used by more than 1% and less than 90% of the users in the training dataset (627 emoji characters, 34,092 unigrams). Users’ friends. Previous studies have systematically found that the characteristics of users’ neighbors – who they decide to follow – are highly correlated with their own characteristics (Chen et al. 2015; Al Zamal, Liu, and Ruths 2012). This result is consistent with the strong homophilic patterns commonly found in social networks (McPherson, Smith-Lovin, and Cook 2001). However, collecting information about the entire network of a given user is costly, often requiring multiple API calls. Instead, the approach I propose here is to focus on which verified accounts users decide to follow, and use this information to predict their latent traits.1 If we consider Twitter as a news media (Kwak, Moon, and Lee 2012), these following decisions can also be informative about users’ interests and preferences. Of the over 154,000 accounts currently verified, I select only 61,659 accounts Verification is granted by Twitter to public figures, including celebrities, media outlets, and politicians, in order to certify that their profile corresponds to their real identity. The full list of verified accounts is publicly available at http://twitter.com/verified. with more than 10,000 followers and English or Spanish as their account language. Similar to an adjacency matrix, the set of features for each individual will be a vector of length 61,659 with value 1 if the user follows that particular account and 0 otherwise (bag-of-followers). As in the previous case, I will also estimate a logistic classifier with Elastic Net regularization and SGD learning to predict users’ traits. These two are not the only possible sources of information about users’ characteristics. Twitter allows users to write a 140-character description of themselves in the profile, and this text has been used in previous studies to build training datasets (Pennacchiotti and Popescu 2011). As discussed in the previous section, first and last names also contain information about individuals’ gender and ethnicity. However, even if in some cases these methods could lead to more accurate predictions, they are limited by the sparsity of the data: many users do fill the ‘description’ field or report a name contained in the existing name datasets. Data Geolocated tweets The first step in the data collection process was to construct a list of U.S. Twitter users whose location is known with county-level granularity. To do so, I collected a random sample of 1.1 billion geolocated tweets from around the world between July 2013 and May 2014. Of these, nearly 250 million tweets from 4.4 unique million users were sent from the contiguous United States. The pairs of coordinates (longitude and latitude) in each tweet was then used to identify the county and zipcode from which each of them was sent, using the shape files that indicate the polygons delimiting each of these geographical units. The ‘name’ field in users’ profiles was also extracted from all the tweets in this dataset, and parsed using regular expressions to split into first, middle, and last name. These two sources of information – geographic (county and zipcode) and name (first and last) – will be used to match Twitter accounts with their publicly available voting records. Voting Registration Records The availability of voting registration records varies across states, depending on the rules imposed by their Secretaries of States. In most cases, they are freely available upon request or after paying a small fee. These files generally contain the full name, residential address, party affiliation, gender, race, and past vote history for all voters that have ever registered to vote. In this project, I use voting records from 15 different states: Arkansas, California, Colorado, Connecticut, Delaware, Florida, Michigan, Nevada, North Carolina, Ohio, Oklahoma, Pennsylvania, Rhode Island, Utah, and Washington. While this set of states was chosen for convenience reasons (in all 15 states the voter records can be easily obtained online), it presents significant variation in electoral outcomes, population, and region. The voting records from each of these states was parsed and standardized to a common file format in order to facilitate the matching process. Table 1: Matching voting records and Twitter users. Registered Twitter Total State Voters Users Matches % Arkansas 1,582,012 32,372 4,615 14.2 California 17,811,391 554,213 65,079 11.7 Colorado 3,500,164 56,844 9,009 15.8 Connecticut 2,186,628 46,840 5,902 12.6 Delaware 645,329 13,008 1,923 14.8 Florida 13,037,192 260,604 36,308 13.9 Michigan 7,425,020 118,919 17,710 14.9 Nevada 1,438,967 57,069 6,724 11.8 North Carolina 5,413,637 127,463 14,292 9.5 Ohio 7,507,994 162,993 28,047 17.2 Oklahoma 1,983,727 48,780 6,746 13.9 Pennsylvania 8,231,634 168,873 21,537 12.7 Rhode Island 740,051 18,557 2,607 14.0 Utah 1,481,505 31,862 3,536 11.1 Washington 4,339,309 65,565 11,226 17.1 Total 77,324,560 1,763,962 233,132 13.2 All the code necessary to run this step is available at github.com/pablobarbera/voter-files. Matching Process A given Twitter account was matched with a voter only when there was a perfect and unique match of first name, last name, county, and state. In cases of multiple Twitter accounts or voters with identical first and last names in a county, they were matched at the zipcode level using the same method. This procedure is conservative on purpose – the goal is to create a training dataset with as little uncertainty as possible about users’ true characteristics. More sophisticated methods, based on geographic distance, could also be implemented in future work. Note that voters’ residential address is available in all states; and these addresses could be easily parsed to coordinates. Table 1 provides summary statistics for the sample sizes considered at each step. The first column indicates the total number of registered voters in each state – their total sum correspond to between 35% and 50% of all registered voters in the U.S., depending on how these are defined. The second column shows the number of Twitter users in each state, based on the dataset of geolocated tweets. The third and fourth columns show the total number of Twitter users that were matched using this method, and the proportion that it represents over the total of Twitter users in each state. This proportion ranges from 9.5% in North Carolina to 17.2% in Ohio. While these proportions may seem low, Bond et al (2012) were only able to match around 33% of Facebook users to voter records, despite having access to users’ birthdates in a much less anonymous social networking site, where users are less likely to use pseudonym. Since the residential address in which each voter is registered is also publicly available, this dataset can also be matched with home property records to obtain a rough estimate of each user’s income. In particular, I queried the Zillow API for the ‘zestimate’ for each address – an estimate of the market value of each individual home, calculated for about 100 million homes in the U.S. based on public and user-submitted data points. More information is available at: www.zillow.com/zestimate/. This quantity is then normalized by multiplying it for the ratio of the median home value in each state over the median home value in the U.S., in order to have comparable values across different states. Despite this transformation, note that home values are still a noisy proxy for citizens’ income. For example, I cannot distinguish whether the home is owned or rented. Despite these limitations, this variable provides a good estimate of a given citizens’ wealth. The final step in the data collection process was to download the list of ‘friends’ for all 233,132 users matched with voting records, as well as their 1,000 most recent tweets. Since 99% of the users in this sample follow fewer than 25,000 accounts, it is possible to construct the feature matrix with fewer than 5 API calls per user. (Each API call can return 200 tweets or 5,000 friends.) After excluding private and suspended Twitter accounts, the total size of the training dataset is 201,800 Twitter accounts. Variables After merging and cleaning all the datasets, in my analysis I will focus on six sociodemographic variables, recoded as follows: • Gender: male or female. • Age: 18-25, 26-40, 40+ (approximately three deciles of age distribution of Twitter users in the sample). • Race: African-American, Hispanic/Latino, Asian/Other, White • Party: Unaffiliated, Democrat, Republican. • Vote: turnout in 2012 presidential election. • Income: normalized home value lower than $150,000, between $150,000 and $300,000, and greater than $300,000 (approximately three deciles of home value distribution in the sample). Results Tables 2 reports the performance of the classifiers for all sociodemographic characteristics. In order to examine the performance of each model, I provide as a baseline the proportion of individuals in the modal categories for each variable (male, 40+, white, unaffiliated, voted in 2012, home value $150K-$300K), as well as the sample size included in the estimation. Accuracy was computed using 5-fold crossvalidation. Note that the total sample size is lower than 40,000 in some cases because not all variables are available in some states, or for all individuals. For example, race is only available in Florida and North Carolina. In Table 3, I provide additional information about the performance of the two main classifiers, after disaggregating each variable into individual categories, and computing accuracy, precision, and recall for each dichotomized indicator. I find that the performance of the classifiers is in all cases better than random or choosing the modal category, with Table 2: Performance of machine learning classifiers (Crossvalidated accuracy, 5 folds) Gend. Age Race Party Vote Inc. Baseline (mode) 51.2 37.2 67.6 38.4 63.0 42.7 N (users, 1000s) 130 202 40 174 196 159 Categories 2 3 4 3 2 3 Text classifiers Bag-of-emoji 69.2 52.0 68.9 40.3 65.3 43.2 Bag-of-words 84.9 65.5 77.3 50.3 67.2 48.1 Network classifiers Bag-of-followers 85.3 63.1 77.6 50.7 64.2 45.7 Combined classifier Boe + Bow + Bof 88.7 68.3 80.5 53.9 67.6 49.4 Table 3: Performance of machine learning classifiers, by category Text Network Variable A P R A P R % Gender Female 88 90 87 86 85 88 48.8 Age 18-25 85 72 68 82 66 61 26.7 26-40 74 67 43 72 64 53 36.1 ≥ 40 74 63 78 73 62 76 37.1 Race/ethnicity African Am. 90 75 34 89 89 30 13.4 Hisp./Latino 88 79 37 86 78 25 17.2 Asian/Other 98 90 11 98 75 2 1.6 White 77 77 97 75 74 98 67.6 Party Democrat 62 51 55 66 54 53 38.4 Republican 76 55 22 76 59 25 36.3 Unaffiliated 61 50 55 58 47 66 25.2 Turnout Voted 67 69 88 65 66 91 63.0 Income Low 73 50 18 72 48 19 27.5 Middle 51 46 77 50 45 79 42.7 High 72 54 31 72 55 27 29.8 A = accuracy; P = precision; R = recall; % = prop. the exception of the bag-of-emoji models. When compared with previous studies, the levels of accuracy reported here are comparable or higher to those previously achieved. For example, Chen et al (2015) achieve 79% accuracy for ethnicity, 88% accuracy for gender, and 67% accuracy for age. Al Zamal (2012) obtain 80% accuracy for age, 80% accuracy for gender, and 92% accuracy for political orientation. However, note that these results are based on features that are much more costly to obtain, or use self-selected samples where it is easier to achieve good performance because they are easier to classify. When comparing the two different methods, a clear pattern emerges: text-based features are as good or even better than network-based measures. The differences are particularly large for age, propensity to vote, and income. While there are differences in across these groups in who they follow (as evidenced by the fact that bag-of-followers features are also good predictors), it appears language traits are more indicative, which is consistent with previous research in computational linguistics. In the case of race, this results is not surprising, given that one of the largest minorities in the sample speaks a language other than English. However, at the same this result also raises questions about the performance of the classifier across different groups within this ethnic community (e.g. firstvs second-generation immigrants). The use of this method implies in practice that members of this community are identified based on their language, and depending on how it is going to be applied, it may lead to a problem of representativeness of the predicted sample of Hispanics with respect to the entire population of Hispanics on Twitter. Additional evidence of this limitation of the model is the low recall levels of some of the classifiers; in other words, many Hispanic Twitter users are not being identified as such, probably because they don’t tweet in Spanish as often. While this problem is perhaps more obvious in this case, it appears to apply to some other sociodemographic groups, such as Republican supporters. An alternative method to evaluate the performance of the classifiers is to identify the emoji characters, words and accounts with the highest and lowest estimated coefficients in the regularized logistic regression. Table 4 reports these sets of words and accounts. To facilitate the interpretation, the coefficients in the network model were weighted by the number of followers (for accounts), in order to make them comparable to the TF-IDF normalization of the emoji and word-based models. These results have high face validity and are consistent with previous studies of language use in psychology and linguistics – see Schwartz et al, (2013) for a review. For example, females use more emotion words and mention psychological and social processes, whereas males use profane words and object references more often. Regarding age, the results show a pattern of progression in individuals’ life cycle: from school and college, to work, and then to family (e.g. some of the most predictive words of being older than 40 are words related to children and grandchildren); and from an emphasis on expressing emotions, to more action and object references. Another strong sign that the method is correctly classifying individuals’ race and ethnicity is that one of the best predictor of each category is the skin tone modifier, which change the aspect of face emojis. Regarding party identification, it appears the use of words and emoji related to marriage equality (e.g. the rainbow emoji), reproductive rights (“women”) and skin tone modifiers a good predictor of a Twitter user being affiliated with the Democratic party, reflecting the sociodemographic composition of this group. Republicans, on the other hand, appear to be more likely to discuss their faith on Twitter. Individuals with no party affiliation are likely to use words that are unrelated to politics. Although the results are not as good for the turnout classifier, words such as “vote” and “news” and the check emoji appear as the best predictors of having voted in 2012. Finally, the emoji and words associated with different income levels indicate another limitation of this method: many of these refer to geographic locations where home values are generally low or high (e.g. fresno and sacramento vs san francisco or miami). However, most of these words indicate the models are capturing some signal: e.g. tweeting about flights, travel, and activites like gold of ski are good predictors of having high levels of income. The results for the network-based model are also consistent with previous work and popular conceptions of the audience for each of these accounts. For gender, just like Culotta et al (2015), I find that following Ellen Degeneres is an excellent predictor of a Twitter user being female, whereas following SportsCenter and other famous sports figures is a good indicator of an user being male. Republicans and Democrats also follow accounts that align with their political preferences: Barack Obama, Rachel Maddow, Bill Clinton; and Fox News, Mitt Romney, and Tim Tebow, respectively. African Americans and Hispanics appear to be likely to follow popular figures in their community, such as Kevin Hart, Oprah Winfrey or LeBron James, or Pitbull, Jennifer Lopez, and Shakira. Whites, on the other hand, are more likely to follow country stars like Blake Shelton. Finally, following Miley Cyrus, UberFacts or Daniel Tosh is a good predictor of being younger than 25 years old, whereas following CNN, Oprah or Jimmy Fallon is more likely among users older than 40. One limitation of this approach is that the training dataset is not representative of Twitter users. Since it only contains individuals who report their real names on their profiles and who are registered to vote, it is likely that the dataset contains users who are more active and more likely to use Twitter to consume political information. In order to evaluate how the classifiers perform out of sample, I took a random sample of 2,000 Twitter users in the U.S. (see next section for details on sample selection) and used the crowd-sourcing platform CrowdFlower to code their gender, race/ethnicity, and age based on their name and profile pictures. Table 5 shows that the out-of-sample performance of this models is lower than in-sample, as expected, but still significantly above any baseline classifier. Table 5: Out-of-sample performance. Observed Network Variable (%) A P R N Gender Female 52 75 69 86 1,461 Race/ethnicity African Am. 14 89 83 27 1,203 Hisp./Latino 24 81 76 31 1,203 Asian/Other 0 NA NA NA 1,203 White 62 71 69 95 1,203 Age 18-25 51 63 66 57 1,329 26-40 40 63 56 36 1,329 ≥ 40 9 72 17 63 1,329 Table 4: Top predictive features (emoji, words, accounts) most associated with each category. Female , , , , , , , , , ♥, , , , , , , , . . . love, women, hair, girl, husband, mom, omg, cute, excited, <3, girls, yay, happy, hubby, boyfriend, :), can’t, baby, wine, thank, heart, nails. . . @TheEllenShow, @khloekardashian, @MileyCyrus, @Starbucks, @jtimberlake, @VictoriasSecret, @WomensHealthMag, @channingtatum. . . Male , , , , , , , , , , , , , , , , . . . bro, man, wife, good, causewereguys, gay, great, dude, f*ck, nice, game, iphone, ni**a, church, time, #gay, girlfriend, bruh, sportscenter. . . @SportsCenter, @danieltosh, @MensHealthMag, @AdamSchefter, @ConanOBrien, @KingJames, @katyperry, @ActuallyNPH. . . Age: 18-25 , , , , , , , , , , , , , , , , . . . class, college, semester, life, (:, sportscenter, campus, best, literally, like, haha, just, :d, finals, classes, okay, professor, exam, studying. . . @SportsCenter, @wizkhalifa, @MileyCyrus, @danieltosh, @instagram, @EmWatson, @KevinHart4real, @UberFacts, @vine. . . Age: 26-40 , , , , , , , , , , , , , , . . . excited, work, amazing, bar, awesome, wedding, #tbt, pretty, #nofilter, ppl, bday, time, lil, #love, yay, #latergram, office, game, tonight, boo, super. . . @danieltosh, @ConanOBrien, @jtimberlake, @StephenAtHome, @chelseahandler, @KimKardashian, @instagram, @NPR, @britneyspears. . . Age: ≥ 40 , , , , , , , , , , , , , , . . . great, daughter, son, nice, r, good, ok, kids, congratulations, obama, hi, nbcthevoice, wow, happy, hope, beautiful, sorry, rock, grandson, amen. . . @jimmyfallon, @cnnbrk, @YouTube, @Pink, @TheEllenShow, @NBCTheVoice, @SteveMartinToGo, @Oprah, @sethmeyers, @FoxNews. . . African Am. , , , , , , , , , , , , , , , , . . . black, smh, #scandal, lol, god, iamsteveharvey, bout, yall, man, ni**a, morning, blessed, wit, y’all, lil, yo, bruh, lord, good, . . . @BarackObama, @instagram, @KevinHart4real, @Oprah, @KingJames, @stephenasmith, @LilTunechi, @Lakers, @YouTube, @MariahCarey. . . Hisp./Latino , , , , , , , , , , , , , , , , . . . miami, lmao, colombia, que, en, #miami, fiu, lmfao, hola, el, fiu, cuban, la, hialeah, hispanic, lol, :d, lmfaooo, tu. . . @instagram, @nytimes, @JLo, @ladygaga, @SofiaVergara, @KimKardashian, @shakira, @georgelopez, @justinbieber, @pitbull, @DJPaulyD. . . Asian/Other , , , , , , , , , , , , , , , , , , , . . . i’m, asian, miami, lol, jacksonville, haha, tampa, :d, :), indian, allah, gainesville, like, india, orlando, #heatnation, #tampa, studying . . . @TheEllenShow, @cnnbrk, @azizansari, @BarackObama, @DalaiLama, @NBA, @mindykaling, @mashable, @UberFacts, @JLin7 . . . White , , , , , , , , , , , , , , ♥, , , . . . tonight, sweet, florida, ya, beach, blakeshelton, cat, haha, beer, think, night, asheville, great, baseball, dog, today, sure, lake . . . @ActuallyNPH, @TheEllenShow, @blakeshelton, @jimmyfallon, @tomhanks, @danieltosh, @Pink, @FoxNews, @RyanSeacrest . . . Democrat , , , ,→, , , , , , , , , , , , , , , . . . philly, barackobama, la, sf, pittsburgh, women, nytimes, philadelphia, smh, president, gop, black, hillaryclinton, gay, republicans . . . @BarackObama, @rihanna, @maddow, @billclinton, @khloekardashian, @billmaher, @Oprah, @KevinHart4real, @algore, @MichelleObama . . . Republican , , , , , , , , , , ♥, , , , , , , , , . . . foxnews, #tcot, church, christmas, oklahoma, florida, obama, great, realdonaldtrump, golf, beach, megynkelly, tulsa, byu, seanhannity . . . @FoxNews, @danieltosh, @TimTebow, @MittRomney, @taylorswift13, @jimmyfallon, @RyanSeacrest, @Starbucks, @JimGaffigan . . . Unaffiliated , , , , , , , , , , , , , , , , . . . ohio, arkansas, columbus, cleveland, cincinnati, utah, toledo, cavs, #wps, browns, ar, akron, hogs, bengals, kent, dayton, #cbj, reds . . . @instagram, @SportsCenter, @KingJames, @vine, @AnnaKendrick47, @wizkhalifa, @WhatTheFFacts, @galifianakisz, @ActuallyNPH. . . Voted , , ,→, , , , , , , , , , , , , , , , . . . great, obama, san, did, kids, vote, cleveland, daughter, nc, news, disneyland, barackobama, church, romney, county, president, california. . . @BarackObama, @TheEllenShow, @jimmyfallon, @FoxNews, @azizansari, @blakeshelton, @MittRomney, @Starbucks, @RyanSeacrest. . . Did not vote , , , , , , , , , , , , , , , , . . . college, life, philly, pittsburgh, bro, sportscenter, miss, florida, im, sh*t, penn, ya, f*ck, gonna, guys, can’t, man, actually, wanna . . . @SportsCenter, @vine, @justinbieber, @wizkhalifa, @MileyCyrus, @UberFacts, @Eminem, @KendallJenner, @Jenna Marbles. . . Income: Low , , , , , , , , , , , ♥♥, , ♥ , , , , , , . . . fresno, sacramento, bakersfield, work, lol, spokane, watching, good, ass, follow, wwe, #raw, :-), baby, wwe, need, im, ready, tired, sleep, bored . . . @instagram, @WhiteHouse, @YouTube, @ArianaGrande, @tomhanks, @stephenasmith, @KevinHart4real, @aliciakeys, @carmeloanthony . . . Income: Middle , , , , , , , , , , , , , , , , , . . . diego, denver, disneyland, vegas, utah, #sandiego, church, sd, tonight, disney, anaheim, las, colorado, worship, abc7, kings, lakewood, awesome. . . @vine, @Usher, @ZooeyDeschanel, @RyanSeacrest, @AdamSchefter, @rihanna, @rainnwilson, @robkardashian, @andersoncooper . . . Income: High , , , , , , , , , , ,→, , , , , , , . . . sf, francisco, best, miami, class, san, great, thanks, nyc, la, congrats, beach, data, michigan, college, philly, flight, actually, #sf, nytimes, seattle . . . @cnnbrk, @jimmykimmel, @StephenAtHome, @adamlevine, @jimmyfallon, @TechCrunch, @neiltyson, @SteveMartinToGo, @nytimes. . . Note: Each row indicates the top 15-20 emoji/words/accounts that better predict each category, not the most common.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Forecasting Stock Price Movements Based on Opinion Mining and Sentiment Analysis: An Application of Support Vector Machine and Twitter Data

Today, social networks are fast and dynamic communication intermediaries that are a vital business tool. This study aims at examining the views of those involved with Facebook stocks so that we can summarize their views to predict the general behavior of this stock and collectively consider possible Facebook stock price movements, and create a more accurate pattern compared to previous patterns...

متن کامل

From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series

We connect measures of public opinion measured from polls with sentiment measured from text. We analyze several surveys on consumer confidence and political opinion over the 2008 to 2009 period, and find they correlate to sentiment word frequencies in contemporaneous Twitter messages. While our results vary across datasets, in several cases the correlations are as high as 80%, and capture impor...

متن کامل

Digital Trace Data in the Study of Public Opinion

In this article, we examine the relationship between metrics documenting politics-related Twitter activity with election results and trends in opinion polls. Various studies have proposed the possibility of inferring public opinion based on digital trace data collected on Twitter and even the possibility to predict election results based on aggregates of mentions of political actors. Yet, a sys...

متن کامل

Using Social Media to Infer Gender Composition of Commuter Populations

In order for a municipality to effectively service and engage its constituency, it must understand the composition of the communities within it. Up to the present, such demographic estimates for target populations have been obtained largely from census data or expensive, time-intensive surveys. In this paper, we use Twitter microblog content to estimate the gender makeup of commuting population...

متن کامل

The Opinions of physicians about Radiology Reports

Background & Aims: Radiology reports are often the only means of communication between radiologists and physicians. Despite the importance of these reports, practically physicians do not give a feedback about radiology reports. This study was aimed to determine the opinions of specialists towards radiology reports. Methods: in this descriptive study, sample consisted of all specialists working ...

متن کامل

Effects of Sampling on Twitter Trend Detection

Much research has focused on detecting trends on Twitter, including health-related trends such as mentions of Influenza-like illnesses or their symptoms. The majority of this research has been conducted using Twitter’s public feed, which includes only about 1% of all public tweets. It is unclear if, when, and how using Twitter’s 1% feed has affected the evaluation of trend detection methods. In...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

Less is more? How demographic sample weights can improve public opinion estimates based on Twitter data

نویسنده

چکیده

منابع مشابه

Forecasting Stock Price Movements Based on Opinion Mining and Sentiment Analysis: An Application of Support Vector Machine and Twitter Data

From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series

Digital Trace Data in the Study of Public Opinion

Using Social Media to Infer Gender Composition of Commuter Populations

The Opinions of physicians about Radiology Reports

Effects of Sampling on Twitter Trend Detection

عنوان ژورنال:

اشتراک گذاری